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ABSTRACT 


Modern  video  applications  call  for  computationally  intensive  data  processing  at  very  high  data  rate.  In 
order  to  meet  the  high-performance/low-cost  constraints,  the  state-of-art  video  processor  should  be  a  pro¬ 
grammable  design  to  perform  various  tasks  in  video  application  whereas  the  computational  power  and  the 
manufacturing  cost  should  not  be  sacrificed  for  exchange  of  such  flexibility.  In  this  paper,  we  present  a  pro¬ 
grammable  video  co-processor  design  for  numerically  intensive  front-end  video/image  communications.  The 
resulting  system  is  a  massively  parallel  architecture  that  is  capable  of  performing  most  low-level  computa¬ 
tionally  intensive  tasks  including  FIR/IIR  filtering,  subband  filtering,  discrete  orthogonal  transforms  (DT), 
adaptive  filtering,  and  motion  estimation,  for  the  host  processor.  Also,  an  interconnection  network  is  used 
to  configurate  the  system  for  desired  data  paths.  Since  the  properties  of  each  programmed  function  such 
as  parallelism  and  pipelinability  have  been  fully  exploited  in  the  design,  the  computational  power  of  this 
co-processor  is  as  fast  as  that  of  the  ASIC  designs  which  are  optimized  for  individual  specific  applications. 
We  also  show  that  the  system  can  be  easily  reconfigurated  to  perform  multirate  FIR/IIR/DT  operations  at 
negligible  hardware  overhead.  Therefore,  we  can  cope  with  extremely  high-speed  data  by  using  the  same 
processing  elements.  This  feature  can  also  be  applied  to  the  low-power  implementation  of  this  co-processor 
since  the  multirate  operations  can  “compensate”  the  increased  delay  caused  by  the  low  supply  voltage  in 
the  low-power  design  without  hindering  the  system  performance.  The  programmable/high-speed  properties 
of  the  proposed  co-processor  design  makes  it  very  suitable  for  video-rate  applications. 


This  work  was  supported  in  part  by  the  ONR  grant  N00014-93-10566  and  the  NSF  NYI  Award 
MIP9457397. 
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1  Introduction 

Modern  communication  services  such  as  high-definition  TV  (HDTV),  video-on-demand  services  (VOD),  and 
PC-based  multimedia  applications  call  for  computationally  intensive  data  processing  at  video  data  rate  which 
includes  low-level  tasks  like  discrete  cosine  transform  (DCT)  and  filtering  (convolution)  operations  as  well 
as  medium-level  tasks  like  motion  estimation  (ME),  variable  length  coding  (VLC),  and  vector  quantization 
(VQ).  All  these  tasks  require  millions  of  basic  operations  (addition/multiplications)  per  second  to  ensure 
the  real-time  performance  of  those  video  applications.  As  a  consequence,  the  traditional  general-purpose 
programmable  digital  signal  processing  (DSP)  processor  is  not  applicable  under  such  speed  constraint. 
Although  the  performance  of  the  DSP  processor  can  be  improved  by  utilizing  advanced  VLSI  technology 
and  special  arithmetic  kernels  [1]  [2] ,  the  manufacturing  cost  as  well  as  the  complexity  of  the  design  will  be 
enormously  increased.  This  approach  is  referred  to  as  the  technology- driven  solution  [3].  On  the  other  hand, 
dedicated  VLSI  application-specific  integrated  circuit  (ASIC)  chip  that  is  optimally  designed  for  a  given 
function  can  handle  the  demanding  computational  tasks.  However,  such  an  approach  will  also  increase  the 
manufacturing  cost  and  system  area  since  a  collection  of  ASIC  chips  are  required  for  the  different  tasks  in 
video  applications.  Hence  we  try  to  find  a  programmable  video  processor  design  that  has  the  flexibility  of 
a  general  DSP  processor  while  it  still  can  meet  the  stringent  speed  requirement  like  the  ASIC  chips. 

Recently,  it  has  been  observed  that  in  order  to  handle  the  signal  processing  applications  in  a  very 
efficient  way,  the  properties  of  the  DSP  algorithms  should  be  fully  exploited  so  as  to  find  the  parallelism  and 
pipelinability  for  very  high-speed  implementations  of  those  algorithms.  As  an  example,  the  rotation-based 
CORDIC  processor  is  suggested  as  a  basic  module  to  perform  FIR  and  HR  filtering  in  a  fully  pipelined  way 
[4]  [5] .  Also,  an  inner-product  processing  kernel  was  found  to  be  the  common  part  for  some  signal  processing 
operations  [3].  These  solutions  exploit  the  inherent  properties  of  the  algorithms  and  are  referred  to  as  the 
algorithm-driven  solutions  [3]. 

The  major  goal  of  this  paper  is  to  integrate  the  rotation-based  FIR/IIR  architecture,  Quadrature  Mirror 
Filter  (QMF)  lattice  structure  [6],  discrete  transform  (DT)  architecture  [7]  [8],  adaptive  lattice  filter  [9],  and 
DCT-based  motion  estimation  scheme  [10],  into  one  universal  programmable  architecture.  It  will  serve  as 
a  co-processor  in  the  video  system  to  perform  all  front-end  computationally  intensive  functions  for  the  host 
processor.  We  will  first  examine  the  inherent  properties  of  each  function,  then  find  the  common  rotation- 
based  computational  modules  that  can  serve  as  the  basic  processing  elements.  Hence,  this  design  is  also  an 
algorithm-driven  solution  for  video  applications  but  is  much  more  general-purpose.  The  resulting  system 
consists  of  an  array  of  identical  programmable  modules  and  one  programmable  interconnection  network. 
Each  programmable  module  will  act  as  the  basic  computational  module  in  each  programmed  function  by 
setting  suitable  parameters  and  switches.  The  interconnection  network  is  used  to  connect  the  modules  and 
to  combine  the  appropriate  module  outputs  according  to  the  data  paths.  The  proposed  architecture  is  very 
suitable  for  VLSI  implementation  due  to  its  modularity  and  regularity. 

The  second  goal  of  this  paper  is  to  improve  the  speed  performance  of  the  system  based  on  the  multirate 
approach  [11].  In  video  signal  processing,  the  major  constraint  is  the  processing  speed  of  the  video  processor. 
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Such  speed  constraint  will  result  in  the  use  of  expensive  fast  multiplier /adder  circuits  or  full-custom  design. 
Thus,  the  cost  and  the  design  cycle  will  increase  drastically.  The  multirate  approach  is  a  powerful  tool 
for  high-speed/low-power  DSP  applications  [12].  By  employing  the  multirate  architecture  with  decimation 
factor  equal  to  two,  we  can  process  data  at,  for  example,  100  MHz  rate  while  using  only  50  MHz  processing 
elements.  Thus,  we  can  speed  up  the  system  at  the  algorithmic/architectural  level  without  using  expensive 
high-speed  processing  elements.  We  will  show  that  by  simply  setting  new  parameters  to  the  programmable 
modules  and  reconfigurating  the  interconnection  network,  we  can  speed  up  the  system  by  performing  the 
multirate  operations  on-the-fly  without  modifying  the  overall  architecture. 

In  the  last  part,  we  will  show  how  to  incorporate  the  feature  of  adaptive  filtering  into  our  co-processor 
design.  The  recursive  least-squares  (RLS)  filter,  which  is  widely  used  in  channel  equalization,  system 
identification,  and  image  restoration,  will  become  another  computationally  intensive  data  processing  used 
in  portable  video  equipments  since  wireless  communication  requires  fast  adaptation  to  the  highly  non¬ 
stationary  mobile  channel.  We  will  show  that,  with  little  modification  to  the  programmable  module  design, 
the  proposed  co-processor  can  also  perform  the  QR-decomposition  based  RLS  lattice  algorithm  (QRD-LSL) 
[9] [13,  Chap. 18]  in  a  fully  pipelined  way. 

The  organization  of  the  this  paper  is  as  follows:  Section  2  presents  the  basic  programmable  co-processor 
design  for  the  FIR,  HR,  QMF  filtering,  and  discrete  transforms.  In  Section  3,  the  speed  up  of  the  co¬ 
processor  design  based  on  the  multirate  approach  is  discussed.  Later  the  incorporation  of  the  QRD-LSL 
array  into  our  co-processor  design  is  presented  in  Section  4  followed  by  the  conclusions. 

2  Video  Co-processor  Design  for  the  FIR/QMF/IIR/DT 

In  this  section,  the  design  of  the  video  co-processor  under  normal  operation  (without  speed-up)  is  discussed. 
We  first  examine  the  the  basic  operations  of  the  FIR  filtering,  QMF  bank,  HR  filtering,  and  discrete 
orthogonal  transforms  (DT),  to  find  the  basic  computational  modules.  Later,  a  universal  programmable 
module  which  integrates  those  basic  computational  modules  in  FIR/QMF/IIR/DT  is  derived.  We  will 
show  that,  by  setting  appropriate  parameters  to  the  module  and  connecting  them  via  a  programmable 
interconnection  network,  we  are  able  to  perform  all  functions  in  the  FIR/QMF/IIR/DT  in  a  fully-pipelined 
way. 

2.1  Basic  Module  in  FIR 

The  finite  impulse  response  (FIR)  filter  is  widely  used  in  the  areas  of  image  processing  and  equalization.  In 
addition  to  the  multiply-and-accumulate  (MAC)  implementation  of  the  filtering  operation,  an  alternative 
realization  of  the  FIR  filter  is  the  lattice  structure  as  shown  in  Fig.l  [14].  It  consists  of  N  basic  lattice 
sections  that  are  connected  in  a  cascade  form.  The  advantages  of  the  lattice  structure  over  the  MAC 
implementation  is  its  robustness  to  the  coefficient  quantization  effect  and  the  smaller  dynamic  range  due  to 
the  orthogonal  operation  used  in  each  lattice. 
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Given  a  iVth-order  FIR  transfer  function 


N- 1 


H(z)  =  1- 


(1) 


m= 0 


the  FIR  lattice  parameters  can  be  computed  as  follows  [15]: 

1.  Initialization:  =  am,  m  =  0, 1, . . . ,  N  —  1. 

2.  For  i  =  N  —  l.N  —  2 _ 0 
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where  the  parameter  fcj’s,  i  =  0, 1, . . . ,  N  —  1,  are  known  as  the  reflection  coefficients ,  or  the  PARtial 
CORrelation  coefficients  (PARCOR)  in  the  theory  of  linear  prediction  [16].  After  kfs  are  computed,  the 
lattice  section  of  the  FIR  filter  can  be  described  by 


xout 

1 

ki 

xin 

xout 

ki 

1 

_  x'in  _ 

(3) 


or  equivalently 
1.  For  |fct|  <  1, 


XOUt 


X 


out 


1  —ki 

V1-*?  Vl~kf 

—kj  1 

cosh  6i  sinh  6i 
sinh@2  cosh 


0 

0  fTkl 

o  J  L  A 


xin 

x\n 

Win 


with 


2.  For  \ki\  >  1, 


x  out 
.1 


Qi  =  tanh  l{—ki). 


(4) 


(5) 


x 


out 


\k% 


-sign(kj) 


vAF1  v^F1 

-sign(kj)  Ifcl 


y/kfl  y/kj-l 
cosh  6i  sinh  9t 
sinh  8i  cosh  0, 


—sign{ki)y/kf  -  1  0 

0  -sign(ki)yjk?  -  1 

- sign(ki)yjk ?  —  1  0 

0  —  sign(ki)  k‘f  —  1 


xin 


x  in 
xin 


(6) 


System  Architecture  of  A  Massively  Parallel  Programmable  Video  Co-Processor 


4 


where  sign(ki)  denotes  the  sign  of  ki  and  0  is  defined  by 

6i  =  tanh-1(-l/A;j).  (7) 

In  general,  (3)  is  preferred  for  the  implementation  using  general-purpose  multipliers.  In  our  unified  design, 
we  will  employ  (4)  and  (6),  which  can  be  implemented  by  using  the  CORDIC  processor  in  hyperbolic  mode 
[5]  together  with  two  scaling  multipliers,  to  realize  the  basic  modules  (see  Fig.l(a)(b)).  Note  that  we  need  to 
swap  the  two  inputs  for  the  case  \ki\  >  1  since  the  input  vector  is  inverted  in  (6).  These  two  basic  modules 
constitute  the  FIR  lattice  structure  as  shown  in  Fig.  1(c). 

2.2  Basic  Module  in  QMF 

The  Quadrature  Mirror  Filter  (QMF)  plays  a  key  role  in  image  compression  and  subband  coding  [17]  [18]. 
Recently,  the  two-channel  paraunitary  QMF  lattice  was  proposed  [6] .  It  possesses  all  the  advantages  of  the 
lattice  structure  such  as  robustness  to  coefficient  quantization,  smaller  dynamic  range,  and  modularity.  Such 
properties  are  preferred  to  the  MAC-based  realization  when  the  filter  bank  is  implemented  using  fixed-point 
arithmetic.  Fig.2  shows  the  QMF  lattice  structure,  where  part  (c)  is  the  analysis  bank  and  part  (d)  is  the 
synthesis  bank.  We  can  see  that  the  QMF  lattice  is  very  similar  to  the  FIR  lattice  except  that  the  inputs  of 
the  lattice  become  the  decimated  sequences  of  the  input  signal.  As  a  consequence,  the  module  in  Fig.  1(b) 
can  be  readily  used  for  the  QMF  lattice  by  setting  the  scaling  multiplers  equal  to  one  and  the  CORDIC 
processor  to  the  circular  mode. 

It  has  been  shown  in  [6]  that  every  two-channel  (real-coefficient,  FIR)  paraunitary  QMF  bank  can  be 
represented  using  the  QMF  lattice.  Given  a  pre-designed  power-symmetric  FIR  analysis  filter  Hq{z)  with 
unit  sample  response  ho(n),  n  =  0, 1, . . . ,  AT,  we  can  first  find  the  unit  sample  response  of  the  other  analysis 
filter  H\(z)  by 

h1(n)  =  (-l)nh0{N-n).  (8) 

Then  the  rotation  angle  0t  in  each  QMF  module  can  be  computed  by  [11,  chap.6]: 

1.  Initialization:  H^J\z)  =  Hq{z)  and  h[J\z)  =  H\{z )  with  N  =  2J  +  1. 

2.  For  i  =  J.  J  —  1 . 0 


(1  +  a2i)Htl)(z) 
(1  +af)z-2Ht1\z) 

0i 


H^(z)-aiH?(z), 
aiH^(z)+H?(z), 
tan-1  a*, 


(9) 


end 


The  coefficient  az  is  computed  by  setting  the  highest  power  of  ^  1  in  Hq  l\z)  —  a  iH[^  (2)  equal  to  zero. 
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2.3  Basic  Module  in  HR 

Next  we  want  to  consider  the  basic  module  for  infinite  impulse  response  (HR)  filtering.  The  lattice  structure 
for  an  HR  system  (all-pole  and  ARMA)  [14]  is  shown  in  Fig.3.  Although  the  basic  lattice  module  in  HR 
is  similar  to  the  one  in  FIR  lattice,  the  opposite  data  flow  in  the  HR  module  makes  it  difficult  to  be 
incorporated  into  our  unified  design.  Besides,  the  modularity  of  the  lattice  structure  no  longer  exists  if  we 
want  to  implement  a  general  IIR  (ARMA)  filter  (see  Fig.3  (b)).  Therefore,  we  try  to  find  an  HR  module  that 
has  similar  data  path  as  in  the  FIR/QMF  lattice  while  retaining  the  modularity  property  of  the  design. 


2.3.1  Second-order  IIR  Lattice  Structure 

Fig.4  shows  the  lattice  structure  that  can  be  used  to  realize  a  second-order  IIR.  It  is  also  known  as  the 
“couple-form”  of  the  second-order  IIR  filter  which  is  robust  to  the  coefficient  quantization  error  under  fixed- 
point  arithmetic  [15,  chap. 6].  It  can  be  shown  that  the  transfer  functions  of  the  two  outputs  are  given 
by 


Ho(z)  = 


Yo(z)  _  r(kocosd  +  kis'mO)  —  r2koz  1 
X(z)  1  —  2r  cos 9z~x  +  r2z~2 


~  .  .  _  Y\(z)  _  r(k\  cos#  —  ko  sin#)  —  r2k\z  1 

1  X(z)  1  —  2r  cosfte-1  +  r2z~2 


(10) 

(11) 


Now  given  an  even-order  real-coefficient  IIR  (ARMA)  filter  H(z),  we  can  first  rewrite  it  in  the  cascade  form: 


N/  2—1 

H(z)  =  K  n  Hi(z), 


i= 0 


where  if  is  a  scaling  constant  1  and  each  subfilter  Hi(z)  is  of  the  form 


Hi{z) 


_  1  +  CiZ~l  +  diZ~2 

1  +  diZ~l  +  biZ~2 

_  _ 1 _ |  __i  Cj+djZ~l 

1  +  diZ~l  +  b{Z~ 2  Z  1  +  diZ~l  +  biZ~2 

=  Aifi{z)  +  z~lAi,i(z). 


(12) 


(13) 


Comparing  (10)  and  (11)  with  (13),  we  see  that  Ai$(z)  and  A^i(z)  can  be  realized  by  either  Hq{z)  or  H\ (z) 
with  appropriate  settings  of  the  parameters  ko,  k\,r,  and  9.  The  conversion  of  the  parameters  is  derived  in 
Appendix,  where  Hq(z)  is  chosen  for  the  realization. 

Now  based  on  (12)  and  (13),  we  can  realize  H{z)  using  the  signal  diagram  depicted  in  Fig.5,  in  which 
each  stage  performs  the  filtering  for  Hi(z )  and  all  A^q,  A^ i,  i  =  0,1,...,  N/2  —  1,  are  realized  by  the 
second-order  IIR  module  in  Fig.4. 


1Here,  we  have  assumed  that  K  =  1.  This  simple  scaling  operation  can  be  done  by  the  host  processor  after  it  collects  the 
outputs  from  the  video  co-processor. 
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2.4  Basic  Module  in  Discrete  Transforms 

Performing  the  discrete  transforms  (DT)  based  on  the  rotation  circuit  has  been  extensively  studied  in  the 
literature  [5].  Recently,  Frantzeskakis  et  al.  [7]  [8]  proposed  a  unified  rotation-based  architecture  for  the  DT. 
By  exploiting  the  “shift  property”  and  “periodicity  property”  of  the  orthogonal  transforms,  the  authors 
have  shown  that  any  orthogonal  transform  can  be  implemented  using  a  time-recursive  architecture.  In  the 
following,  we  will  first  review  the  time-recursive  MLT  architecture  in  [7]  [8].  Later,  we  will  show  how  this 
MLT  architecture  can  be  used  to  realize  most  of  the  orthogonal  transforms. 


2.4.1  Time- Recursive  MLT  Architecture 

The  Modulated  Lapped  Transform  (MLT)  is  defined  as  [19]: 


2N-1 


v  2  V-  .  IT  ,  1,  r  7T  .  lw  1  AL,  ,  „ 

xMLT(k)  =ck\/—  sm2iv(n+  h)C0SMA:  +  h)(n+  O  +  T )]*(”) 


n= 0 


N 


(14) 


for  k  =  0, 1, . . . ,  N  —  1,  where  Ck  =  (— 1  )(fc+2)/2  if  k  is  even,  and  Ck  =  (— 1)(*  1 if  k  is  odd.  After  some 
algebraic  manipulations,  the  MLT  can  be  decomposed  into 


XMLr(k)  =  —Ck[  Xc(k  +  1)  +  Xs(k)  ] 


(15) 


where 


L~l 

Xc(k)  =  0  cos[(2n  +  1  )ujk  +  r)k\x(n), 

n=0 
A  L_1 

Xs{k)  =0^2  sin[(2n  +  l)uk  +  %]x(n), 

n= 0 


with  block  size  L  =  2N  and 


1 


V2N’ 


A  nk  ,  A  7T 1. 

=  2 N’  and  Vk  =  2 ^  +  2  ' 


(16) 

(17) 


(18) 


(15)  describes  how  the  two  functions,  Xc{k)  and  Xs{k),  are  combined  together  to  obtain  the  MLT  coeffi¬ 
cients.  In  the  following  discussions,  we  shall  refer  to  (15)  as  the  combination  function. 

In  [7]  [8],  a  rotation-based  module  was  derived  for  the  dual  generation  of  Xc(k)  and  Xs(k)  as  depicted 
in  Fig.6(a),  where  the  scaling  multipliers  and  the  rotation  operation  are  given  by 


fo,k 

/?cos((2L  +  l)u)k  +rjk ) 

f\,k 

/3sin((2L  +  l)ook  +  rjk) 

ffc  = 


(19) 


System  Architecture  of  A  Massively  Parallel  Programmable  Video  Co-Processor 


7 


and 


Ri  = 


cos  dk  sin  6k 

cos(2wjfc)  sin(2wfc) 

—  sin  6k  cos  9k 

—  sin(2o»fc)  cos(2o>fc) 

(20) 


respectively.  The  rotation-based  module  works  in  a  serial-input-parallel-output  (SIPO)  way:  the  block  input 
data  is  fed  serially  into  the  module.  After  the  updating  of  the  last  datum  is  completed,  the  values  of  Xc(k) 
and  Xs(k)  in  (16)  and  (17)  are  available  at  the  module  outputs. 

The  aforementioned  module  can  be  used  as  a  basic  building  block  to  implement  MLT  based  on  (15). 
Fig.6(b)  illustrates  the  overall  time-recursive  MLT  architecture  for  the  case  N  =  8.  It  consists  of  two  parts: 
One  is  the  module  array  which  computes  Xc{k)  and  Xs(k),  k  =  0, 1, . . . ,  N  -  1,  in  parallel.  The  other  is 
the  interconnection  network  which  selects  and  combines  the  array  outputs  to  generate  the  MLT  coefficients 
according  to  the  combination  function  defined  in  (15). 

In  [12],  it  has  been  observed  that  the  MLT  architecture  in  Fig.6  can  realize  most  existing  discrete  sinu¬ 
soidal  transforms  by  setting  appropriate  parameters  and  defining  the  combination  functions.  For  example, 
Xc(k)  in  (16)  is  equivalent  to  the  DCT  by  setting 


kir 

L  =  N,  P  =  ck,  uk  =  — ,  and  r)k  =  0, 


where 


ck  = 


Jr,  if  k  =  0 


(21) 


(22) 


otherwise. 

is  the  scaling  factor  of  the  DCT/IDCT.  As  a  result,  the  MLT  module  array  in  Fig.6  can  compute  the  DCT 
in  parallel.  The  other  example  is  the  DFT  with  real-valued  inputs.  With  the  following  parameter  setting 

1 


L  =  N,  p  = 


-j=,  uk  -  and  r)k  =  -uk, 


(23) 


(16)  and  (17)  become 


xc (k)  =  ~^=  Y  cos(~~j^~kn)  x(n),  Xs{k)  =  ~^=  Y  sM~jf-kn)  x(n), 


(24) 


which  are  the  real  part  and  the  imaginary  part  of  the  DFT,  respectively.  The  DHT  can  be  computed  using 
the  same  parameter  setting  as  the  DFT  except  that  the  interconnection  network  in  Fig.6(b)  performs  as 


XDHr(k)  —  Xc(k )  +  Xs(k).  (25) 

The  parameter  settings  as  well  as  the  corresponding  combination  functions  for  most  DT  are  summarized  in 
Table  1. 
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2.5  Unified  Module  Design 

Prom  Fig.l,  Fig.2,  Fig.4,  and  Fig.6,  we  observe  that  all  the  architectures  share  the  common  computational 
module  with  only  some  minor  differences  in  the  data  path,  the  module  parameters  (multiplier  coefficients 
and  rotation  angle),  and  the  way  the  modules  are  connected.  We  are  thus  motivated  to  integrate  those  basic 
modules  into  one  universal  programmable  module. 

Fig.7  shows  the  proposed  programmable  module.  It  consists  of  six  switches,  four  scaling  multipliers  and 
one  rotation  circuit.  The  switch  set  S  =  [S0S1S2S3S4S5]  controls  the  data  path  inside  the  module.  The  switch 
pair  so  and  si  select  the  input  from  either  im  or  in':  With  sosi  =  00,  int  becomes  the  common  input  of  the 
lattice  which  is  required  in  the  first  stage  of  FIR  and  in  the  HR  module.  Using  sosi  =  10,  we  can  swap  the 
inputs  for  the  FIR  lattice  when  |fcj|  >  1.  Switches  S2  and  S3  decide  if  the  delay  element  is  used  or  not:  With 
S2S3  —  01,  the  lower-left  delay  element  is  included  in  the  data  path,  which  is  required  in  the  FIR/QMF 
lattice  (except  the  first  stage  in  QMF  banks).  With  the  setting  S2S3  =  11,  the  delay  element  in  Fig.5  can  be 
incorporated  into  the  module  A^.  Therefore,  we  do  not  need  to  implement  it  explicitly  in  the  HR  filtering 
operation.  The  last  switch  pair  is  S4  and  S5.  They  control  the  two  feedback  paths  in  the  module:  When 
S4S5  =  11,  the  delayed  module  outputs  will  be  added  with  the  current  inputs  as  in  the  HR  and  DT  case.  The 
setting  S4S5  =  00  will  disconnect  the  feedback  paths.  The  scaling  multipliers  and  the  rotation  circuit  are 
commonly  used  in  all  basic  modules  of  the  FIR/QMF/IIR/DT.  The  parameters  f)  and  0l  can  be  determined 
from  our  discussions  in  Section  2.1-Section  2.4.  The  two  extra  multipliers  rj’s  at  the  outputs  of  the  rotation 
circuit  are  required  if  we  want  to  incorporate  HR  filtering  function  into  this  universal  design.  The  complete 
settings  of  the  programmable  module  for  the  FIR/QMF/IIR/DT  are  listed  in  Table  2. 

2.6  Video  Co-processor  Design 

Based  on  the  above  programmable  module,  we  are  ready  to  design  the  video  co-processor  that  is  capable  of 
performing  parallel  implementation  for  any  function  in  the  FIR/QMF/IIR/DT.  Fig.8  shows  the  video  co¬ 
processor  architecture  under  the  FIR  mode.  It  consists  of  two  parts:  One  is  the  programmable  module  array 
with  P  identical  programmable  modules.  The  other  is  the  programmable  interconnection  network  to  connect 
those  programmable  modules  according  to  the  data  paths.  In  the  FIR/QMF/IIR,  the  data  are  processed  in 
a  serial-input-serial-output  (SISO)  way.  Hence,  the  programmable  modules  need  to  be  cascaded  for  those 
operations.  For  example,  the  FIR  modules  can  be  connected  by  setting  the  interconnection  network  as 
shown  in  Fig.8.  The  connections  of  HR  modules  can  be  also  achieved  using  the  network  setting  in  Fig.9. 
On  the  other  hand,  the  DT  architecture  in  Section  2.4  performs  the  block  transforms  in  a  SIPO  way.  The 
interconnection  network  will  be  configurated  according  to  the  combination  functions  defined  in  Table  1.  The 
detailed  settings  of  the  interconnection  network  used  in  this  paper  (Type  I-IX)  are  described  in  Table  3. 

The  operation  of  the  co-processor  is  as  follows:  In  the  initialization  mode ,  the  host  processor  will  compute 
all  the  necessary  parameters  f;,rj,0j  according  to  the  function  type  (FIR/QMF/IIR/DT)  and  the  function 
to  be  performed.  In  general,  the  functions  to  be  performed  are  determined  beforehand.  Hence,  all  the 
parameters  can  be  computed  in  advance  so  that  the  host  processor  can  find  the  necessary  parameters  through 
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table-look-up  to  reduce  the  set-up  time  in  this  mode.  Next,  the  host  processor  needs  to  reconfigurate  the 
interconnection  network  according  to  the  function  type. 

Once  the  video  co-processor  is  initialized,  it  enters  the  execution  mode.  In  the  applications  of  FIR/IIR/QMF, 
the  host  processor  continuously  feeds  the  data  sequence  into  the  co-processor.  After  the  first  output  data  is 
ready,  the  processor  can  collect  the  filtering  outputs  in  a  fully  pipelined  way.  In  the  block  DT  application, 
the  block  input  data  is  fed  into  the  co-processor  serially.  After  the  last  datum  enters  the  unified  module 
array,  the  evaluations  of  Xc(k)  and  X$(k)  in  (16)  and  (17)  are  complete.  Then  the  interconnection  net¬ 
work  will  combine  the  module  outputs  according  to  the  combination  function  defined  in  Table  1,  and  the 
transform  coefficients  can  be  obtained  in  parallel  at  the  outputs  of  the  network.  At  the  same  time,  the  host 
processor  will  reset  all  internal  registers  (delay  elements)  of  the  programmable  modules  to  zero  so  that  the 
next  block  transform  can  be  conducted  immediately. 

The  real-time  processing  speed  as  well  as  the  programmable  feature  of  this  design  makes  it  very  attractive 
for  video-rate  applications.  The  programmable  feature  can  significantly  reduce  the  hardware  cost  compared 
to  the  ASIC-based  implementations.  Meanwhile,  we  do  not  trade  the  processing  speed  for  this  flexibility 
since  all  operations  are  performed  in  a  parallel  and  fully-pipelined  way  as  in  the  ASIC  implementation. 
Moreover,  the  resulting  system  is  modular  and  regular,  hence  are  suitable  for  VLSI  implementation. 

2.7  Design  Examples 

In  what  follows,  we  will  use  some  design  examples  to  demonstrate  how  to  convert  a  given  system  specification 
to  the  parameters  used  in  the  programmable  modules.  The  orders  of  the  numerator  and  the  denominator 
in  the  HR  ARMA  filter  are  restricted  to  be  even  so  that  we  can  perform  all  the  necessary  decomposition. 
Here,  a  10-module  co-processor  is  used  to  carry  out  the  given  function.  As  a  result,  the  maximum  order  of 
the  FIR/IIR/QMF  is  10  and  the  transform  size  of  the  DT  is  also  limited  to  10.  For  the  DT,  we  will  use  an 
8-point  DCT  as  an  example  due  to  its  prevalence  in  the  application  of  transform  coding. 

2.7.1  FIR  Filtering 

Given  the  FIR  transfer  function 

H(z)  =  1  -  0.8843z_1  -  0.1327z-2  -  1.1219^“3  +  0.5328*-4  -  0.8882z~5 

+  0.1038z“6  -  0.3786z-7  +  0.2195^8  -  0.1094*"9,  (26) 

we  first  apply  (l)-(2)  to  compute  all  PARCOR  coefficients: 

k0  =  -0.4472,  ki  =  -0.6917,  k2  =  -0.5865,  k3  =  -4.1573,  k4  =  1.1595, 

k5  =  0.2655,  h  =  0.2942,  k7  =  -0.1243,  k8  =  0.1094. 

Then  all  parameters  of  each  module,  such  as  f,  and  9i,  can  be  found  by  using  (4)-(7).  The  complete  settings 
are  listed  in  Table  4(a). 
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2.7.2  QMF  Filtering 

Suppose  that  the  pre-designed  power-symmetric  low-pass  filter  described  in  [11,  Example  5.3.2]  is  used  for 
the  QMF  filtering.  We  have  the  analysis  filter 

N- 1 

H0(z)  -  ^2  h0(n)z~n  (27) 

n=0 

with 


Mo)  = 

0.1605, 

*o(l)  = 

0.4156, 

M2) 

=  0.4592, 

M3)  = 

0.1487, 

M4)  = 

-0.1642, 

M  5)  = 

-0.1245, 

M6) 

=  0.0825, 

M7)  = 

0.0888, 

M  8)  = 

-0.0508, 

h0(9)  = 

1 

O 

O 

O 

OO 

h0(10) 

=  0.0352, 

*o(H)  = 

0.0399, 

M12)  - 

-0.0256, 

M13)  = 

-0.0244, 

^-o  (14) 

=  0.0186, 

h0(15)  - 

0.0135, 

M16)  = 

-0.0131, 

h0(l7)  = 

-0.0074, 

A0(18) 

=  0.0129, 

4(19)  = 

-0.0050. 

We  can  go  through  (8)-(9)  to  find  all  0*’ s  in  the  modules  and  the  results  are  shown  in  Table  4(b). 


2.7.3  IIR  (ARMA)  Filtering 


Given  the  IIR  (ARMA)  filter 


with  M  =  4,  N  =  10,  and 


H(z)  = 


i  +  Eiii  Piz~l 
l  +  ZliQiZ-* 


(28) 


Pi  =  -1.7314, 

P2  = 

1.6788, 

P3  = 

-0.7913, 

Pi  = 

0.2304, 

Qi  = 

0.4036, 

02  = 

1.3227, 

<13  = 

0.2376, 

<74  = 

1.1558,  q5 

=  0.0047, 

<?6  = 

0.6950, 

t n  = 

-0.0733, 

< 18  = 

0.2735, 

<79  = 

—0.0542,  <710 

=  0.0788, 

we  first  rewrite  it  in  cascade  form: 


1  -  1.1314z-1  +  0.6400z-2  w  1  -  0.6000Z-1  +  0.3600*-2 
( Z ’  ~  1  -  0.9192Z"1  +  0.4225z“2  X  1  -  n  '7C™— 1  1  n 
1 

X  - — — ; - — ~ X 


0.7500Z"1  +  0.5625Z-2 
1 


X  X 

1  +0.80002-1  +0.64002-2  X  1  +  1.2728 2”1  +  O.8IOO2-2 
1 


1  -l  n  funrw-2  ■ 


(29) 


Following  the  steps  described  in  (49)-(56),  we  can  find  the  parameters  used  in  each  2nd-order  subfilter.  The 
corresponding  settings  are  in  Table  4(c). 


2.7.4  Block  DCT 

For  the  block  8-point  DCT,  we  can  calculate  /o,i,  /i,i  and  0,;  for  each  module  using  (19)-(22).  The  settings 
are  listed  in  Table  4(d). 
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3  Speed-Up  of  the  Video  Co-processor  Architecture 

In  video  signal  processing,  the  fundamental  bottleneck  is  the  processing  speed  of  the  processing  elements. 
Although  the  above  mentioned  co-processor  architecture  has  fully  exploited  the  parallelism  and  pipelinability 
for  each  programmed  function,  the  input  data  rate  is  still  limited  by  the  speed  of  the  adders  and  multipliers 
inside  the  programmable  module.  In  the  video-rate  applications  such  as  HDTV,  this  speed  constraint  will 
result  in  the  use  of  expensive  high-speed  multiplier/adder  circuits  or  full-custom  design.  Thus,  the  cost  as 
well  as  the  design  cycle  will  increase  drastically. 

Recently,  the  multirate  FIR/IIR  filtering  architecture  has  been  proposed  [20]  [21].  Fig.  10  shows  the 
multirate  architecture  to  realize  a  given  function  H(z),  where  Ho(z),  H\(z)  are  the  polyphase  components 
of  H(z),  and  H(z)  =  H\(z)  +  H^z).  As  we  can  see,  the  multirate  architecture  can  be  readily  applied 
to  very  high-speed  filtering  operation.  For  example,  it  can  process  data  at  100  MHz  rate  while  only  50 
MHz  processing  elements  are  required.  Thus,  the  aforementioned  speed  constraint  can  be  resolved  at  the 
algorithmic/architectural  level  by  trading  some  hardware  overhead  or  chip  area.  In  this  section,  we  will  show 
that  our  video  co-processor  can  be  easily  reconfigurated  to  perform  the  multirate  FIR  and  HR  (ARMA) 
filtering.  That  is,  we  can  speed  up  the  co-processor  on  the  spot.  Moreover,  the  incorporation  of  the  multirate 
DT  architecture  in  [12]  is  also  considered. 


3.1  Multirate  FIR  Architecture 

Given  a  IVth-order  FIR  filter 

N 

H(z)='£hiz~i>  (30) 

i= 0 

its  polyphase  components  are  given  by  [11] 

N/2  N/2 

Hq{z2)  =  ^2  hi}0z~2\  Hi(z2)  =  hi,iz~2li  (31) 

i= 0  i=0 

with  hitQ  =  /i2j  and  hiy  =  h,2i+i,  for  i  =  0, 1, . . . ,  N/2.  We  also  have 

N/2 

H{z 2)  =  H0(z2)  +  H,(z 2)  =  £  hiZ-2i  (32) 

i= 0 

where  hi  =  hly  +  htji,  for  i  =  0, 1, . . . ,  y  2 .  The  implementation  of  the  multirate  FIR  is  as  follows:  We 
first  implement  each  (y)th-order  FIR  subfilter  using  the  FIR  lattice  structure  discussed  in  Section  2.1.  The 
resulting  architecture  is  depicted  in  Fig.  11  (a),  where  R{,  R{,  and  Rt  correspond  to  the  ith  basic  modules 
used  in  Hq(z),  H(z),  and  H\  (z),  respectively.  Next,  we  can  map  Fig.  11  (a)  to  our  video  co-processor  with 


2 In  Fig.10,  the  data  sample  rate  is  reduced  from  f3  to  /„/ 2  after  the  downsampling  circuit.  As  a  consequence,  the  delay  z  2 
at  fs  is  equivalent  to  z~1  at  f.,/2,  and  we  use  notations  Hq(z),  H(z),  Hi(z)  instead  of  Ho(z2),  H(z2),Hi(z1)  in  Fig.10. 
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the  mapping 

Ri 

R-i 

Ri 

for  *  =  0, 1, ,  N/ 2-1.  Besides,  the  interconnection  network  is  set  to  Type  II  for  the  connections.  Fig.ll(b) 
illustrates  the  realization  of  a  6t/i-order  FIR  by  the  use  of  9  programmable  modules.  The  detailed  setting 
of  the  video  co-processor  is  described  in  Table  2  and  3. 

Once  the  video  co-processor  has  been  initialized,  the  host  processor  can  send  data  at  fs  rate  to  the 
downsampling  circuit  in  Fig.10.  Then  the  outputs  of  the  downsampling  circuit,  Xi(n),i  =  0,1,2,  will  be 
processed  by  the  three  FIR  subfilters  in  parallel  at  only  fs/2  rate.  After  the  subfilter  outputs  y2(n)’s  are 
generated,  the  FIR  filtering  output  y(n)  is  reconstructed  through  the  upsampling  circuit  in  a  fully-pipelined 
way  and  the  data  rate  for  y(n)  is  back  to  fs. 

As  we  can  see,  the  only  hardware  overhead  for  this  multirate  operation  is  the  downsampling  circuit  and 
upsampling  circuit  in  Fig.10  for  the  pre-  and  post-processing  of  the  data.  Since  we  need  N/2  modules  for 
the  implementation  of  each  subfilter,  a  total  of  37V/2  modules  will  be  used  for  a  ATh-order  FIR  filter.  That 
is,  the  system  performance  is  doubled  at  the  expense  of  50%  hardware  overhead.  Nevertheless,  this  overhead 
is  handled  by  simply  activating  more  modules  in  the  array  and  reconfigurating  the  interconnection  network 
rather  than  implementing  it  explicitly. 


M3m, 

A?3l+2, 


(33) 


3.2  Multirate  HR  Architecture 

For  the  IIR  systems,  the  polyphase  decomposition  of  the  transfer  function  is  not  as  straightforward  as  in 
the  FIR  case.  Given  an  IIR  system 

M 

1  +  Y^Piz~l 

H(z)  = - ^ -  (34) 

1  +  E  Q i*~l 

i= 1 

(M  <  N,  M,N  are  even  numbers),  we  first  multiply  (1  -I-  the  numerator  and  denominator 

of  the  transfer  function.  We  then  have 


H{z)  = 


M  N 

d+E«0(i+E(-i )‘«o  .  „,2, 

fel  i.l  _  N(z)  N0(z2)  +  z  ‘iVi(z2) 


N 


-2  i 


D(z 2) 


D(z2) 


(35) 


j= l 


where  Nq(z2)  and  Ni(z2)  are  the  polyphase  components  of  the  new  numerator,  N(z),  and  qi  s  are  given  by 


Qi  = 


\i„2  .  v^JV-i-1 


i  f  i<f 


(-1  Yqf  +  if  i>  f  ■ 


(36) 
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Thus,  the  polyphase  decomposition  of  H(z)  can  be  written  as 


H(z)  =  H0(z2)  +  z-'H^z2) 


(37) 


with 


H0(z 2) 


No(z2) 
D(z 2)  ’ 


Hi(z2) 


Nxjz2) 
D(z2)  ‘ 


(38) 


Note  that  the  maximum  order  of  Hq{z)  and  H\(z)  is  still  N\  i.e.,  the  orders  of  the  subfilters  do  not 
decrease  after  the  decomposition.  This  indicates  that  the  use  of  Fig.  10  will  triple  the  hardware  cost.  The 
implementation  of  the  HR  multirate  filtering  is  similar  to  the  FIR  case.  We  first  implement  each  of  the 
subfilters,  Hq(z),  H\(z),  and  H(z)  =  Hq(z)  +  Hi(z),  using  the  cascade  HR  structure  discussed  in  Section  2.3. 
The  corresponding  parallel  architecture  is  shown  in  Fig.l2(a),  where  {  Aifl{z),Ai,i(z)  },  {  Aifi(z),  Ahi(z)  }, 
and  {  Aito(z),  A^i(z)  },  i  =  0, 1, . . . ,  N/2  —  1,  are  the  subfilters  of  Ho(z),  H{z),  H\(z),  respectively.  Then  it 
can  be  mapped  to  our  co-processor  architecture  by 


Aifl(z)  — M3i,  Aito(z)  — M3i_ |_2,  Aifi(z)  — t  M3i+ 4, 

Ai,i(z)  — >  M3i+ 1,  Aiyi(z)  — >  M3j+3,  Aiti(z)  — >  M3i+5, 


(39) 


for  i  —  0, l,...,iV/2  -  1.  Fig.l2(b)  demonstrates  the  multirate  4f/i-order  HR  architecture  using  12  pro¬ 
grammable  modules.  The  detailed  settings  of  the  modules  and  interconnection  network  can  be  found  in 
Table  2  and  3.  In  general,  we  will  need  3 N  modules  to  realize  a  Nth- order  multirate  HR  filter. 


3.3  Multirate  Discrete  Transform  Architecture 

The  multirate  DT  architecture  was  discussed  in  [12],  in  which  the  multirate  computation  of  the  DT  is 
applied  to  low-power  VLSI  implementation  of  the  transform-coding  kernels.  Since  the  multirate  DT  in  [12] 
is  derived  based  on  the  second-order  HR  in  direct  form,  it  is  not  applicable  to  the  programmable  architecture 
proposed  here.  We  will  derive  the  rotation-based  multirate  DT  architecture  so  that  it  can  be  incorporated 
into  our  co-processor  design. 

Splitting  the  input  data  sequence,  x(n),i  =  0, 1, . . . ,  L,  into  the  even  sequence 

xe(n)  —  x(2n),  n  =  0, 1, . . .  ,L/2  —  1,  (40) 

and  the  odd  sequence 

x0(n)  =  x(2n  +  1),  n  =  0, 1, . . .  ,L/2  —  1,  (41) 

(16)  and  (17)  can  be  rewritten  as 

L/ 2-1  L/ 2-1 

*c(k)  =  P  £  cos[(4n  +  l)wfc  +  r?fc]  xe(n)  +  /3  cos[(4n  +  3)<x>k  +  %]  x0(n) 

n= 0  n=0 

=  XC,e(k)+Xc,o(k)> 


(42) 
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L/ 2—1  L/2-l 

xs(k)  =  (3  ^2  sm[(4n  +  l)u}k  +  r}k]xe(n)  +  P  sin[(4n  +  3)wfc  +  rjk]  x0(n) 

n= 0  n=0 


(43) 


for  k  =  0, 1, . . . ,  N  —  1.  Following  the  derivations  in  [7] [8],  it  can  be  shown  that  we  can  use  the  rotation-based 
module  in  Fig.6(a)  for  the  dual  generation  of  Xc,e(k)  and  Xs,e{k )  by  setting 


fo,k 

j3  cos((2L  +  l)o)fc  +  r/k) 

)  R'fcje  — 

cos(4a>fc)  sin(4wfc) 

fl,k 

/3sin((2L  +  l)ujk  +  rjk) 

—  sin(4wfc )  cos  (4wfc ) 

Similarly,  the  same  module  can  be  used  to  obtain  Xc,0(k)  and  Xs.o(k)  with  the  setting 


fo,k 

/3cos((2L  +  3)u}h  +  r]k) 

R.  = 

cos(4cafc)  sin(4wfc) 

l 

_ 1 

/3sin((2T  +  3)o>*  +  rjk) 

—  sin(4o;fc)  cos(4wfc) 

(44) 


(45) 


The  parallel  architecture  to  realize  (42)-(45)  is  depicted  in  Fig.l3(a).  The  input  data  sequence  x(n)  is 
first  decimated  into  xe(n)  and  x0(n)  through  the  decimator  (the  extra  delay  element  on  the  top  is  used 
to  synchronize  the  two  output  sequences  so  that  they  can  arrive  at  the  two  rotation  modules  at  the  same 
time.).  Then  Xc,e(k)  ,  Xs<e{k),  Xc,o(k),  and  Xs,0(k )  are  generated  by  the  two  modules  in  parallel  and  the 
outputs  are  summed  up  to  obtain  Xc(k)  and  Xs(k). 

The  multirate  DT  architecture  can  be  mapped  to  the  co-processor  design  by  setting  the  parameters 
fjfeje,R .k>e  to  module  M2k  and  ffc)0,Rfc,0  to  M2k+i,  for  k  =  0,1,..  .,N/ 2  -  1.  Fig.13  shows  the  multirate  4- 
point  DHT  architecture  based  on  8  programmable  modules.  There  are  two  parts  inside  the  interconnection 
network:  One  is  the  summation  circuit  to  combine  the  even  and  odd  outputs  of  the  array.  The  other  is 
the  circuit  to  perform  the  combination  function  defined  in  Table  1.  In  real  implementation,  we  can  either 
add  one  summation  circuit  so  that  the  switch  settings  for  the  DT  in  Table  3  can  still  be  used,  or  we  can 
define  new  switch  settings  by  merging  these  two  circuits  together.  The  hardware  overhead  to  perform  the 
multirate  DT  is  the  doubled  complexity. 


3.4  Design  Examples  Using  Multirate  Operations 

3.4.1  Multirate  FIR  Filtering 

Given  the  FIR  transfer  function  in  (26),  we  first  perform  the  polyphase  decomposition  which  yields 

H0(z)  =  1  -  0.1327z-1  +  0.5328z-2  +  0.1038z-3  +  0.2195Z-4, 

Hx{z)  =  -0.8843  -  1.12192-1  -  0.8882*-2  -  0.3786*“3  -  0.1094^"4, 

H(z)  =  0.1157  -  1.2546z_1  -  0.3554z~2  -  0.2748z~3  +  O.llOlz-4. 


(46) 
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Then  we  can  follow  the  steps  in  (l)-(2)  and  (4)-(7)  to  find  the  parameters  for  each  filter  in  (46).  The  results 
are  listed  in  Table  5(a). 


3.4.2  Multirate  IIR  (ARMA)  Filtering 

Consider  the  IIR  (ARMA)  filter  shown  below 


H(z) 


1  -0.4000.z-1  +  0.1600z-2 

1  -  1.8192*-1  +  2.0598Z-2  -  1.1248z~3  +  0.3422^-4 ' 


We  can  find  its  polyphase  components  from  (35)-(36)  as 


H0(z)  = 
Hi(z)  = 
H(z)  = 


1  +  1.4921*-1  +  0.2219*-2  +  0.0548z“3 
1  +  0.8100*-1  +  0.83462-2  +  0.1446^-3  +  0.11712-4  ’ 
1.4192  +  0.59202-1  +  0.0431Z"2 
1  +  O.8IOO2-1  +  0.8346^-2  +  0.1446^-3  +  0.11712-4  ’ 
2.4192  +  2.084U-1  +  0.26502~2  +  0.05482~3 
1  +  O.SlOOz-1  +  0.8346;?-2  +  0.14462-3  +  0.11712-4 ' 


(47) 


(48) 


Then  the  necessary  parameters  of  each  AMRA  filter  in  (48)  can  be  computed  from  (49)-(56)  in  Appendix, 
the  parameter  settings  for  the  programmable  module  are  in  Table  5(b). 


3.4.3  Multirate  8-point  DCT 

The  rotation  parameters  6i  s  and  the  scaling  factors  /o/s,  fi/s  for  the  modules  operating  on  the  even  and 
odd  subsequences  can  be  found  by  using  (44)  and  (45),  respectively.  See  settings  in  Table  5(c). 


4  Incorporation  of  the  QRD-LSL  Architecture 

In  the  last  part,  we  will  incorporate  the  feature  of  adaptive  filtering  into  the  co-processor  design.  We  will 
show  that,  with  little  modification  of  the  programmable  module  design,  the  proposed  co-processor  can  also 
perform  the  QR-decomposition  based  recursive  least-squares  lattice  (QRD-LSL)  algorithm  [9]. 

4.1  CORDIC  Operation  and  QRD-LSL  Architecture 

The  CORDIC  is  capable  of  evaluating  various  rotation  functions  based  on  the  shift-and-add  operations  [5] . 
There  are  two  operation  modes  in  the  CORDIC  processor:  one  is  the  vector  rotation  mode  (see  Fig.l4(a)) 
which  will  rotate  the  2-input  vector  for  a  given  angle  9.  Let  W  be  the  total  iteration  number  in  CORDIC 
algorithm.  In  real  implementation,  the  rotation  is  performed  by  feeding  a  sequence  of  ±1,  m,  i  =  0, 1, . . . ,  W, 
to  the  CORDIC  processor.  Suppose  that  the  rotation  circuit  inside  our  programmable  module  is  implemented 
using  the  CORDIC  processor.  The  values  of  fan's  can  be  calculated  in  advance  and  will  be  loaded  to  the 
module  array  during  the  initialization  mode.  On  the  other  hand,  the  CORDIC  in  angle  accumulation  mode 
(see  Fig.l4(b))  is  to  rotate  the  input  vector  until  one  of  input  components  is  annihilated;  meanwhile,  the 
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Hout  sequence  that  reflect  the  performed  rotation  are  generated.  In  the  applications  of  adaptive  filtering,  we 
will  use  this  mode  for  the  updating  of  the  RLS  parameters. 

The  QRD-LSL  algorithm  is  one  of  the  most  promising  candidate  for  the  implementation  of  the  recursive 
least-square  (RLS)  adaptive  filtering.  Fig.l5(a)  shows  the  overall  architecture  to  perform  the  linear  predic¬ 
tion.  The  readers  may  refer  to  [9][13,  Chap. 18]  for  detailed  operations.  The  QRD-LSL  can  be  implemented 
using  the  CORDIC  processors  by  replacing  the  angle  computer  with  CORDIC  in  angle  accumulation  mode 
( Ra(6 )),  while  the  rotator  is  replaced  with  CORDIC  in  vector  rotation  mode  (Rr(0)).  The  resulting  system 
is  shown  in  Fig.  15(b),  where  the  dashed  lines  denote  the  data  paths  for  the  /j  sequences.  The  m  sequences 
will  be  first  computed  by  the  Ra{&Y s  using  the  forward  and  backward  signals  at  each  stage.  Later  the 
generated  m  sequences  will  be  sent  to  i?fl(0)’s  to  rotate  the  signals  at  each  stage.  The  CORDIC-based 
QRD-LSL  can  be  considered  as  a  special  case  of  the  CORDIC-based  QRD-RLS  array  discussed  in  [5]. 

4.2  Mapping  QRD-LSL  to  the  Programmable  Video  Co-processor 

From  Fig.  15,  we  observe  that  the  basic  modules  used  in  QRD-LSL  are  very  similar  to  our  programmable 
module.  Also,  the  connections  can  be  easily  handled  by  the  interconnection  network.  We  thus  modify  the 
programmable  module  by  adding  one  direct  path  as  well  as  one  more  switch  for  selecting  this  new  path. 
On  the  other  hand,  one  input  port  for  fim  and  one  output  port  for  /iout  are  also  added  for  the  propagation 
of  the  rotation  parameters  (see  Fig.  16).  Now  using  the  new  programmable  module,  we  can  implement  the 
QRD-LSL  in  a  very  straightforward  way.  The  detailed  settings  of  the  module  array  as  well  as  the  connection 
type  can  be  found  in  Table  2  and  3.  Fig.  17  shows  the  implementation  of  a  4th-order  QRD-LSL  based  on 
our  programmable  co-processor,  where  the  adaptive  filtering  is  performed  in  a  fully-pipelined  way  without 
any  feedback  path. 

5  Conclusions 

In  this  paper,  a  programmable  video  co-processor  design  for  numerically  intensive  front-end  video/image 
communications  is  presented.  The  proposed  parallel  architecture  can  perform  various  functions  (FIR/QMF/ 
IIR/DT/QRD-LSL)  for  the  host  processor  by  simply  loading  the  suitable  parameters  and  reconfigurating 
the  interconnection  network.  The  parallel  design  as  well  as  the  fully-pipelined  operation  of  the  co-processor 
architecture  retains  all  advantages  of  the  ASIC  design  but  is  much  more  flexible.  Moreover,  the  architecture 
is  regular  and  modular.  As  a  consequence,  they  are  very  suitable  for  VLSI  implementation. 

We  also  showed  that  we  can  reconfigurate  the  video  co-processor  to  perform  the  multirate  FIR/IIR/DT 
with  only  two  additional  small  circuits  for  the  down-  and  up-sampling  operations.  The  significance  of  the 
feature  is  twofold:  Firstly,  we  can  speed  up  the  performance  of  the  co-processor  since  the  processing  elements 
can  now  process  data  that  are  twice  as  fast  as  its  processing  speed.  Secondly,  the  multirate  operation  can  be 
applied  to  the  low-power  implementation  as  discussed  in  [12].  In  most  low-power  VLSI  designs,  the  supply 
voltage  is  usually  reduced  to  lower  the  total  power  consumption.  However,  the  device  speed  will  be  degraded 
as  the  supply  voltage  goes  down.  In  [12],  it  has  been  shown  that  the  multirate  scheme  can  “compensate” 
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the  increased  delay  caused  by  the  low  supply  voltage  without  hindering  the  system  performance.  Thus, 
the  co-processor  can  have  a  switch  for  the  supply  voltage.  Under  normal  operation,  the  supply  voltage 
is  5V.  When  the  job  is  not  computationally  demanding,  the  supply  voltage  is  switched  to  3.1V  and  the 
co-processor  switched  to  multirate  mode.  The  system  will  still  maintain  the  processing  speed  even  though 
each  processing  element  inside  the  module  has  been  slowed  down  by  the  reduced  voltage. 

Another  interesting  application  is  to  incorporate  the  DCT-based  motion  estimation  (ME)  scheme  [22]  [10] 
into  our  co-processor  design.  Since  the  DCT-based  ME  requires  DCT/DST  as  a  basic  processing  kernel, 
and  the  computations  is  inherently  highly  local  operation,  the  programmable  co-processor  can  be  modified 
to  perform  the  function  of  ME. 


Appendix 

Conversion  of  Parameters  for  the  second-order  HR  Lattice  Filter 
For  i  =  0, 1, . . . ,  N/2  —  1, 

1.  Find  the  poles  of  Hi(z): 


Po,i  = 


-a{  +  Jaf  -  4bi  -ai-Jaf-  4k 

—  - 1 -  •  —  - - - _ - 


Pl,i  = 


(49) 


2.  (a)  For  the  case  \Jaf  —  46;  <0  (complex  conjugate  poles  at  rje±0*),  compute  the  radius  rj  and 
phase  6i  of  the  poles: 

ri  =  mag(p0,i),  =  arg(p0,j).  (50) 


(b)  For  the  case  \jaf  -  46l  >  0  (two  real  poles  at  po,i  and  pitj),  compute  n  and  0i  by  equating 


2r;  cos  9l  =  p0,i  +  P\,i 
rf  —  Po,i '  P\,ii 


which  yields 


ri  =  V Po,i  ■  Pi, i 
9i  =  cos-1  #y== 


(51) 


(52) 


^y/P0,i'Pl,i 


3.  Solve  ko  and  k\  used  in  Al^(z)  by  setting 


ri(ko  cos 0i  +  k\  sin#,)  =  1 
—r}kn  =  0 


(53) 


which  yields 


h  =  0 

ki  =  l/(rism9i). 


(54) 
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4.  Solve  k0  and  k\  used  in  Ai^z)  by  setting 


which  yields 


|  r*  ( cos  +  ki  sin  9i)  — 

[  -rf  fc0  =  di 

ko  =  ~dt/rf 

k\  =  (ci/ri  -  k0cos$i)/ sin 6,. 


(55) 


(56) 


end. 


All  ri  s  should  be  less  than  one  to  ensure  the  stability  of  the  HR  filtering.  There  are  some  limitations  in 
this  realization:  Firstly,  we  cannot  realize  the  second-order  HR  which  has  two  multiple  real  poles  or  two 
real  poles  with  opposite  signs  (r*  in  (51)  cannot  be  solved).  In  some  cases,  this  situation  can  be  avoided  by 
arranging  the  real  poles  with  the  same  sign  as  a  pair  or  imposing  such  constraints  in  the  design  phase  of 
the  filter.  Secondly,  the  order  of  the  ARMA  filter  is  restricted  to  be  even  to  facilitate  the  decomposition  in 
(12). 
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Table  1:  Parameter  settings  for  the  unified  discrete  transformation  architecture,  where  Re{XoFT{k)}  and 
Im{XDFT(k)}  denote  the  the  real  part  and  the  imaginary  part  of  the  DFT,  respectively. 
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Table  3:  Switch  settings  for  the  interconnection  network. 
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Table  4:  Settings  for  the  (a)  FIR  filter,  (b)  QMF  filter,  (c)  HR  (ARMA)  filter,  and  (d)  8-point  DCT. 
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Table  5:  Settings  for  the  (a)  FIR  filter,  (b)  HR  (ARMA)  filter,  and  (c)  8-point  DCT  under  multirate 
operation. 
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Ri 


(C) 


Figure  1:  (a)  Basic  lattice  filter  section  with  |fcj|  <  1.  (b)  Basic  lattice  filter  section  with  \kj]  >  1.  (c)  Lattice 
FIR  structure. 


Figure  3:  (a)  All-pole  HR  lattice,  (b)  General  HR  (ARMA)  lattice. 
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Figure  4:  Second-order  HR  lattice  architecture. 


Figure  5:  HR  (ARMA)  structure  based  on  the  second-order  HR  lattice  module. 


Figure  6:  SIPO  MLT  architecture:  (a)  Rotation-based  module  for  the  dual  generation  of  Xc(k)  and  Xs{k), 
where  the  downsampling  operation  \,  L  at  the  right  end  denotes  that  we  pick  up  the  values  of  Xc{k) 
and  Xs(k)  at  the  Lth  clock  cycle  and  ignore  all  the  previous  intermediate  results,  (b)  Overall  transform 
architecture. 
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(a) 


Si  =0  Si  =1  S4=0  S4=l  S5  =  0  S5=l 

i  =  0, 1,  2,  3  (b) 

Figure  7:  (a)  Programmable  module  for  the  FIR/QMF/IIR/DT.  (b)  Switches  used  in  the  module. 


Unified  Module  Interconnection 

Array  Netowrk 


Figure  8:  Overall  architecture  for  FIR  filtering. 
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Programmable  Interconnection 

Module  Array  Netowrk 


Figure  9:  Overall  architecture  for  HR  (ARMA)  filtering. 


Figure  10:  Multirate  filtering  architecture,  where  fs  is  the  data  sample  rate. 
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Figure  11:  (a)  Multirate  FIR  based  on  the  lattice  structure,  (b)  Mapping  part  (a)  to  the  co-processor  archi¬ 
tecture:  The  figure  demostrates  the  multirate  Gth-oider  FIR  architecture  using  9  programmable  modules. 
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(b) 

Figure  12:  (a)  Multirate  HR  based  on  the  lattice  structure  discussed  in  Section  2.3.  (b)  Mapping  part  (a) 
to  the  co-processor  architecture:  The  figure  demostrates  the  multirate  4t/i-order  HR  architecture  using  12 
programmable  modules. 
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(b) 

Figure  13:  (a)  Multirate  architecture  for  the  dual  generation  of  Xc{k )  and  Xs{k).  (b)  Multirate  4-point 
DHT  architecture  based  on  8  programmable  modules. 
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Figure  15:  (a)  QRD-LSL  structure,  (b)  Realizing  the  QRD-LSL  using  the  CORDIC  processor. 
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Si  =0  Si  =  l  S4=0  S4  =1  S5=0  S5  =  1 

i  =  0, 1,2,  3,  6 

(b) 


Figure  16:  (a)  New  programmable  module  with  the  QRD-LSL  feature,  (b)  Switches  used  in  the  module. 
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Figure  17:  Realizing  the  QRD-LSL  using  the  programmable  video  co-processor. 


